Preparations

Load the necessary libraries

library(car)       #for regression diagnostics
library(broom)     #for tidy output
library(ggfortify) #for model diagnostics
library(sjPlot)    #for outputs
library(knitr)     #for kable
library(effects)   #for partial effects plots
library(emmeans)   #for estimating marginal means
library(ggeffects) #for partial effects plots
library(MASS)      #for glm.nb
library(MuMIn)     #for AICc
library(DHARMa)    #for residual diagnostics plots
library(modelr)    #for auxillary modelling functions
library(performance) #for residuals diagnostics
library(see)         #for plotting residuals
library(patchwork)   #for grids of plots
library(tidyverse) #for data wrangling
theme_set(theme_classic())

Scenario

Loyn (1987) modeled the abundance of forest birds with six predictor variables (patch area, distance to nearest patch, distance to nearest larger patch, grazing intensity, altitude and years since the patch had been isolated).

Regent honeyeater

Format of loyn.csv data file

abund dist ldist area graze alt yr_isol
.. .. .. .. .. .. ..
abund Abundance of forest birds in patch- response variable
dist Distance to nearest patch - predictor variable
ldist Distance to nearest larger patch - predictor variable
area Size of the patch - predictor variable
graze Grazing intensity (1 to 5, representing light to heavy) - predictor variable
alt Altitude - predictor variable
yr_isol Number of years since the patch was isolated - predictor variable

The aim of the analysis is to investigate the effects of a range of predictors on the abundance of forest birds.

Read in the data

loyn = read_csv('../data/loyn.csv', trim_ws=TRUE) %>%
  janitor::clean_names()
## Parsed with column specification:
## cols(
##   ABUND = col_double(),
##   AREA = col_double(),
##   YR.ISOL = col_double(),
##   DIST = col_double(),
##   LDIST = col_double(),
##   GRAZE = col_double(),
##   ALT = col_double()
## )
glimpse(loyn)
## Rows: 56
## Columns: 7
## $ abund   <dbl> 5.3, 2.0, 1.5, 17.1, 13.8, 14.1, 3.8, 2.2, 3.3, 3.0, 27.6, 1.…
## $ area    <dbl> 0.1, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2…
## $ yr_isol <dbl> 1968, 1920, 1900, 1966, 1918, 1965, 1955, 1920, 1965, 1900, 1…
## $ dist    <dbl> 39, 234, 104, 66, 246, 234, 467, 284, 156, 311, 66, 93, 39, 4…
## $ ldist   <dbl> 39, 234, 311, 66, 246, 285, 467, 1829, 156, 571, 332, 93, 39,…
## $ graze   <dbl> 2, 5, 5, 3, 5, 3, 5, 5, 4, 5, 3, 5, 2, 1, 5, 5, 3, 3, 3, 2, 2…
## $ alt     <dbl> 160, 60, 140, 160, 140, 130, 90, 60, 130, 130, 210, 160, 210,…

Exploratory data analysis

Avoid multicollinearity (correlated variables CAN be included to soak up the variance but CANNOT be interpreted while other correlated variables are there!)

# pairs(loyn)
scatterplotMatrix(~abund + dist + ldist + area + graze + alt + yr_isol, 
                  data=loyn, diagonal = list(method = 'boxplot'),
                  regLine = list(col="red"))

glm(abund ~ ., data = loyn) %>% vif
##     area  yr_isol     dist    ldist    graze      alt 
## 1.337627 1.841657 1.227387 1.255028 2.307661 1.574537

Predictors must be transformed, as the family relates only to the response. Log area in particular is particularly skewed.

scatterplotMatrix(~abund + log(dist) + log(ldist) + log(area) + 
                    graze + alt + yr_isol, 
                  data=loyn, diagonal = list(method = 'boxplot'),
                  smooth=list(col.spread="green"),
                  regLine = list(col="red"))

glm(abund ~ log(dist) + log(ldist) + log(area) + 
                    graze + alt + yr_isol, data = loyn) %>% vif
##  log(dist) log(ldist)  log(area)      graze        alt    yr_isol 
##   1.654553   2.009749   1.911514   2.524814   1.467937   1.804769
# loyn <- loyn %>%
#   mutate(larea = log(area)) %>% 
#   dplyr::select(-area)

Graze has a high VIF, consider deleting. Also, should maybe be ordinal/categorical instead.

Fit the model

loyn <- loyn %>% 
  mutate(fgraze = as.factor(graze),
         l_dist = log(dist),
         l_ldist = log(ldist),
         l_area = log(area),
         sl_dist = scale(l_dist, scale=F),
         sl_ldist = scale(l_ldist, scale=F),
         sl_area = scale(l_area),
         salt = scale(alt, scale=F),
         syr_isol = scale(yr_isol, scale=F)
)
  
loyn.glm <- glm(abund ~ sl_dist + sl_ldist + sl_area + 
                    fgraze + salt + syr_isol,
                data = loyn,
                family = gaussian(link = "identity"))
vif(loyn.glm) # none above 3
##              GVIF Df GVIF^(1/(2*Df))
## sl_dist  1.907874  1        1.381258
## sl_ldist 2.228456  1        1.492801
## sl_area  2.201116  1        1.483616
## fgraze   7.925036  4        1.295314
## salt     1.597400  1        1.263883
## syr_isol 3.252591  1        1.803494

The rule is we don’t want values above 3 for VIF, otherwise, that variable is correlated to the others in a way that obscures its interpretation.

There’s also a function to easily compare scaled effect sizes of all variables rather than do so here. That way, we can still interpret the function with the appropriate units here, but also check effect sizes later on as well.

Model formula: \[ y_i \sim{} \mathcal{N}(\mu_i, \sigma^2)\\ log(\mu_i) = \boldsymbol{\beta} \bf{X_i} \]

where \(\boldsymbol{\beta}\) is a vector of effects parameters and \(\bf{X}\) is a model matrix representing the additive effects of the scaled versions of distance (ln), distance to the nearest large patch (ln), patch area (ln), grazing intensity, year of isolation and altitude on the abundance of forest birds.

Model validation

autoplot(loyn.glm, which = 1:6)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Residuals are ok, normality not great, but should be robust to this. Cook’s d are all low (less than 0.8).

loyn.resid <- simulateResiduals(loyn.glm, plot=T)

Homoscedasticity is ok too.

When you have multiple variables, need to also plot the residuals against each predictor variable too.

loyn$resid <- loyn.resid$scaledResiduals
loyn %>%
  ggplot(aes(x = l_dist, y = resid)) +
  geom_point() +
  geom_smooth(se=F)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

loyn %>%
  ggplot(aes(x = l_ldist, y = resid)) +
  geom_point() +
  geom_smooth(se=F)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

loyn %>%
  ggplot(aes(x = l_area, y = resid)) +
  geom_point() +
  geom_smooth(se=F)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

loyn %>%
  ggplot(aes(x = alt, y = resid)) +
  geom_point() +
  geom_smooth(se=F)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

loyn %>%
  ggplot(aes(x = yr_isol, y = resid)) +
  geom_point() +
  geom_smooth(se=F)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# Didn't have enough time to finish this last plot
# loyn %>%
#   
#   group_by(fgraze) %>%
#   summarise(mean = mean(resid),
#             se = sd(resid) / sqrt( n() ),
#             upr = gmodels::ci(resid)[3]) %>%
#   ggplot(aes(x = fgraze, y = mean)) +
#   geom_bar() +
#   geom_pointrange(aes(ymin = lwr, ymax = upr))

Partial plots

loyn.glm %>% ggemmeans(~sl_dist) %>% plot(add=T)

loyn.glm %>% ggemmeans(~sl_ldist) %>% plot(add=T)

loyn.glm %>% ggemmeans(~salt) %>% plot(add=T)

loyn.glm %>% ggemmeans(~syr_isol) %>% plot(add=T)

loyn.glm %>% ggemmeans(~sl_area + fgraze) %>% plot(add=T)

plot_model(loyn.glm, type = 'eff', show.data = T, dot.size = 0.5) %>%
  plot_grid
## Warning in plot_grid(.): Not enough tags labels in list. Using letters instead.

Note that if we had used log() and scale() in the glm function, plot_model would do the appropriate back-transform to the original scale, amazingly!

log-area is clearly important, as well as graze = 5.

plot(allEffects(loyn.glm, residuals = T), type = "response")
## Warning in Analyze.model(focal.predictors, mod, xlevels, default.levels, : the
## predictors sl_dist, sl_ldist, sl_area, salt, syr_isol are one-column matrices
## that were converted to vectors

## Warning in Analyze.model(focal.predictors, mod, xlevels, default.levels, : the
## predictors sl_dist, sl_ldist, sl_area, salt, syr_isol are one-column matrices
## that were converted to vectors

## Warning in Analyze.model(focal.predictors, mod, xlevels, default.levels, : the
## predictors sl_dist, sl_ldist, sl_area, salt, syr_isol are one-column matrices
## that were converted to vectors

## Warning in Analyze.model(focal.predictors, mod, xlevels, default.levels, : the
## predictors sl_dist, sl_ldist, sl_area, salt, syr_isol are one-column matrices
## that were converted to vectors

## Warning in Analyze.model(focal.predictors, mod, xlevels, default.levels, : the
## predictors sl_dist, sl_ldist, sl_area, salt, syr_isol are one-column matrices
## that were converted to vectors

## Warning in Analyze.model(focal.predictors, mod, xlevels, default.levels, : the
## predictors sl_dist, sl_ldist, sl_area, salt, syr_isol are one-column matrices
## that were converted to vectors

Caterpillar plot

plot_model(loyn.glm,  type='est')

plot_model(loyn.glm,  type='est', transform='exp', show.values=TRUE)

Model investigation / hypothesis testing

summary(loyn.glm)
## 
## Call:
## glm(formula = abund ~ sl_dist + sl_ldist + sl_area + fgraze + 
##     salt + syr_isol, family = gaussian(link = "identity"), data = loyn)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -15.8992   -2.7245   -0.2772    2.7052   11.2811  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.47274    2.35785   9.531 1.83e-12 ***
## sl_dist       0.14456    1.19334   0.121   0.9041    
## sl_ldist      0.34641    0.92835   0.373   0.7107    
## sl_area       5.55123    1.22130   4.545 3.97e-05 ***
## fgraze2       0.52851    3.25221   0.163   0.8716    
## fgraze3       0.06601    2.95871   0.022   0.9823    
## fgraze4      -1.24877    3.19838  -0.390   0.6980    
## fgraze5     -12.47309    4.77827  -2.610   0.0122 *  
## salt          0.01070    0.02390   0.448   0.6565    
## syr_isol     -0.01277    0.05803  -0.220   0.8267    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 37.27021)
## 
##     Null deviance: 6337.9  on 55  degrees of freedom
## Residual deviance: 1714.4  on 46  degrees of freedom
## AIC: 372.52
## 
## Number of Fisher Scoring iterations: 2

Evidence of an effect of log-area and of grazing area 5 being less than grazing 1, while all other grazing levels were not different from grazing 1. Need to investigate pair-wise effects among the factor levels thereafter.

Further analyses

Loyn, R. H. 1987. “Nature Conservation: The Role of Remnants of Native Vegetation.” In, edited by D. A. Saunders, G. W. Arnold, A. A. Burbridge, and A. J. M. Hopkins. Chipping Norton, NSW: Surrey Beatty & Sons.